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ABSTRACT 

I present a hybrid matrix factorisation model representing 
users and items as linear combinations of their content fea¬ 
tures’ latent factors. The model outperforms both collabo¬ 
rative and content-based models in cold-start or sparse in¬ 
teraction data scenarios (using both user and item meta¬ 
data), and performs at least as well as a pure collaborative 
matrix factorisation model where interaction data is abun¬ 
dant. Additionally, feature embeddings produced by the 
model encode semantic information in a way reminiscent of 
word embedding approaches, making them useful for a range 
of related tasks such as tag recommendations. 
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H.3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval —Information Filtering 
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1. INTRODUCTION 

Building recommender systems that perform well in cold- 
start scenarios (where little data is available on new users 
and items) remains a challenge. The standard matrix fac¬ 
torisation (MF) model performs poorly in that setting: it is 
difficult to effectively estimate user and item latent factors 
when collaborative interaction data is sparse. 

Content-based (CB) methods address this by representing 
items through their metadata 10 . As these are known in 


advance, recommendations can be computed even for new 
items for which no collaborative data has been gathered. 
Unfortunately, no transfer learning occurs in CB models: 
models for each user are estimated in isolation and do not 
benefit from data on other users. Consequently, CB models 
perform worse than MF models where collaborative infor¬ 
mation is available and require a large amount of data on 
each user, rendering them unsuitable for user cold-start . 
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At Lyst, solving these problems is crucial. We are a fash¬ 
ion company aiming to provide our users with a convenient 
and engaging way to browse—and shop—for fashion online. 
To that end we maintain a very large product catalogue: 
at the time of writing, we aggregate over 8 million fashion 
items from across the web, adding tens of thousands of new 
products every day. 

Three factors conspire to make recommendations chal¬ 
lenging for us. Firstly, our system contains a very large 
number of items. This makes our data very sparse. Sec¬ 
ondly, we deal in fashion: often, the most relevant items 
are those from newly released collections, allowing us only 
a short window to gather data and provide effective recom¬ 
mendations. Finally, a large proportion of our users are first¬ 
time visitors: we would like to present them with compelling 
recommendations even with little data. This combination of 
user and item cold-start makes both pure collaborative and 
content-based methods unsuitable for us. 

To solve this problem, I use a hybrid content-collaborative 
model, called LightFM due to its resemblance to factorisa¬ 
tion machines (see Section |^. In LightFM, like in a col¬ 
laborative filtering model, users and items are represented 
as latent vectors (embeddings). However, just as in a CB 
model, these are entirely defined by functions (in this case, 
linear combinations) of embeddings of the content features 
that describe each product or user. For example, if the movie 
‘Wizard of Oz’ is described by the following features: ‘mu¬ 
sical fantasy’, ‘Judy Garland’, and ‘Wizard of Oz’, then its 
latent representation will be given by the sum of these fea¬ 
tures’ latent representations. 

In doing so, LightFM unites the advantages of content- 
based and collaborative recommenders. In this paper, I 
formalise the model and present empirical results on two 
datasets, showing that: 

1. In both cold-start and low density scenarios, LightFM 
performs at least as well as pure content-based models, 
substantially outperforming them when either (1) col¬ 
laborative information is available in the training set 
or (2) user features are included in the model. 

2. When collaborative data is abundant (warm-start, dense 
user-item matrix), LightFM performs at least as well 
as the MF model. 

3. Embeddings produced by LightFM encode important 
semantic information about features, and can be used 
for related recommendation tasks such as tag recom¬ 
mendations. 



This has several benefits for real-world recommender sys¬ 
tems. Because LightFM works well on both dense and sparse 
data, it obviates the need for bnilding and maintaining mnl- 
tiple specialised machine learning models for each setting. 
Additionally, as it can use both user and item metadata, it 
has the quality of being applicable in both item and user 
cold-start scenarios. 

To allow others to reproduce the results in this p^er, I 
have released a Python implementation of LightFJVlM and 
made the source code for this paper and all the experiments 
available on Githul|3 

2. LIGHTFM 

2.1 Motivation 

The structure of the LightFM model is motivated by two 
considerations. 

1. The model must be able to learn user and item repre¬ 
sentations from interaction data: if items described as 
‘ball gown and ‘pencil skirt’ are consistently all liked 
by users, the model must learn that ball gowns are 
similar to pencil skirts. 

2. The model must be able to compute recommendations 
for new items and users. 

I fulfil the first requirement by using the latent representa¬ 
tion approach. If ball gowns and pencil skirts are both liked 
by the same users, their embeddings will be close together; 
if ball gowns and biker jackets are never liked by the same 
users, their embeddings will be far apart. 

Such representations allow transfer learning to occur. If 
the representations for ball gowns and pencil skirts are simi¬ 
lar, we can confidently recommend ball gowns to a new user 
who has so far only interacted with pencil skirts. 

This is over and above what pure CB models using di¬ 
mensionality reduction techniques (such as latent semantic 
indexing, LSI) can achieve, as these only encode information 
given by feature co-occurrence rather than user actions. For 
example, suppose that all users who look at items described 
as aviators also look at items described as wayfarers, but 
the two features never describe the same item. In this case, 
the LSI vector for wayfarers will not be similar to the one 
for aviators even though collaborative information suggests 
it should be. 

I fulfil the second requirement by representing items and 
users as linear combinations of their content features. Be¬ 
cause content features are known the moment a user or item 
enters the system, this allows recommendations to be made 
straight away. The resulting structure is also easy to un¬ 
derstand. The representation for denim jacket is simply a 
sum of the representation of denim and the representation 
of jacket; the representation for a female user from the US 
is a sum of the representations of US and female users. 

2.2 The Model 

To describe the model formally, let U be the set of users, 
I be the set of items, be the set of user features, and 
the set of item features. Each user interacts with a number 
of items, either in a favourable way (a positive interaction), 

^https://github.com/lyst/lightfm/ 

^https://github.com/lyst/lightfm-paper/ 


or in an unfavourable way (a negative interaction). The set 
of all user-item interaction pairs {u,i) £ U x I is the union 
of both positive S'"*" and negative interactions S~. 

Users and items are fully described by their features. Each 
user u is described by a set of features fu C F’^ . The same 
holds for each item i whose features are given by fi C F^. 
The features are known in advance and represent user and 
item metadata. 

The model is parameterised in terms of d-dimensional user 
and item feature embeddings and for each feature /. 
Each feature is also described by a scalar bias term {b^ for 
user and bf for item features). 

The latent representation of user u is given by the sum of 
its features’ latent vectors: 



je/„ 


The same holds for item v. 

P^ = Y. 

3&fi 

The bias term for user u is given by the sum of the features’ 
biases: 

= I] 

The same holds for item i: 

b^ = Y3 

jefi 

The model’s prediction for user u and item i is then given 
by the dot product of user and item representations, ad¬ 
justed by user and item feature biases: 

Vui = f {Qu ■ Pi + bu + b,) (1) 

There is a number of functions suitable for /(•). An identity 
function would work well for predicting ratings; in this pa¬ 
per, I am interested in predicting binary data, and so after 
Rendle et al. I choose the sigmoid function 

“ 1 -I- exp(-a:) ’ 

The optimisation objective for the model consists in max¬ 
imising the likelihood of the data conditional on the param¬ 
eters. The likelihood is given by 

L ,e\b^^ x (1 - r„i) (2) 

{ti,i)es+ (ii,i)es- 

I train the model using asynchronous stochastic gradient 
descent I use four training threads for experiments 

performed in this paper. The per-parameter learning rate 
schedule is given by Adagrad 

2.3 Relationship to Other Models 

The relationship between LightFM and the collaborative 
MF model is governed by the structure of the user and item 
feature sets. If the feature sets consist solely of indicator 
variables for each user and item, LightFM reduces to the 
standard MF model. If the feature sets also contain meta¬ 
data features shared by more than one item or user, LightFM 
extends the MF model by letting the feature latent factors 
explain part of the structure of user interactions. 

This is important on three counts. 




1. In most applications there will be fewer metadata fea¬ 
tures than there are users or items, either because 
an ontology with a fixed type/category structure is 
used, or because a fixed-size dictionary of most com¬ 
mon terms is maintained when using raw textual fea¬ 
tures. This means that fewer parameters need to be es¬ 
timated from limited training data, reducing the risk of 
overfitting and improving generalisation performance. 

2. Latent vectors for indicator variables cannot be esti¬ 
mated for new, cold-start users or items. Representing 
these as combinations of metadata features that can 
be estimated from the training set makes it possible to 
make cold-start predictions. 

3. If only indicator features are present, LightFM should 
perform on par with the standard MF model. 

When only metadata features and no indicator variables 
are present, the model in general does not reduce to a pure 
content-based system. LightFM estimates feature embed¬ 
dings by factorising the collaborative interaction matrix; this 
is unlike content-based systems which (when dimensionality 
reduction is used) factorise pure content co-occurrence ma¬ 
trices. 

One special case where LightFM does reduce to a pure 
CB model is where each user is described by an indicator 
variable and has interacted only with one item. In that 
setting, the user vector is equivalent to a document vector in 
the LSI formulation, and only features which occur together 
in product descriptions will have similar embeddings. 

The fact that LightFM contains both the pure CB model 
at the sparse data end of the spectrum and the MF model at 
the dense end suggests that it should adapt well to datasets 
of varying sparsity. In fact, empirical results show that it 
performs at least as well as the appropriate specialised model 
in each scenario. 


3. RELATED WORK 

There are a number of related hybrid models attempting 
to solve the cold-start problem by jointly modelling content 
and collaborative data. 

Soboroff et al. represent users as linear combinations 
of the feature vectors of items they have interacted with. 
They then perform LSI on the resulting item-feature ma¬ 
trix to obtain latent user profiles. Representations of new 
items are obtained by projecting them onto the latent fea¬ 
ture space. The advantage of the model, relative to pure 
CB approaches, consists in using collaborative information 
encoded in the user-feature matrix. However, it models user 
preferences as being defined over individual features them¬ 
selves instead of over items (sets of features). This is unlike 
LightFM, where a feature’s effect in predicting an interac¬ 
tion is always taken in the context of all other features char¬ 
acterising a given user-item pair. 

Saveski et 


18 perform joint factorisation of the user- 


item and item-feature matrices by using the same item latent 
feature matrix in both decompositions; the parameters are 
optimised by minimising a weighted sum of both matrices’ 
reproduction loss functions. A weight hyperparameter gov¬ 
erns the relative importance of accuracy in decomposing the 
collaborative and content matrices. A similar approach is 
used by McAuley et al. [11| for jointly modelling ratings 
and product reviews. Here, LightFM has the advantage of 


simplicity as its single optimisation objective is to factorise 
the user-item matrix. 

Shmueli et al. [20| represent items as linear combinations 
of their features’ latent factors to recommend news articles; 
like LightFM, they use a single-objective approach and min¬ 
imise the user-item matrix reproduction loss. They show 
their approach to be successful in a modified cold-start set¬ 
ting, where both metadata and data on other users who have 
commented on a given article is available. However, their ap¬ 
proach does not extend to modelling user features and does 
not provide evidence on model performance in warm-start 
scenario. 

LightFM fits into the hybrid model tradition by jointly 
factorising the user-item, item-feature, and user-feature ma¬ 
trices. From a theory standpoint, it can be construed as a 
special case of Factorisation Machines . 

FMs provide an efficient method of estimating variable in¬ 
teraction terms in linear models under sparsity. Each vari¬ 
able is represented by a fc-dimensional latent factor; the in¬ 
teraction between variable i and j is then given by the dot 
product of their latent factors. This has the advantage of 
reducing the number of parameters to be estimated. 

LightFM further restricts the interaction structure by only 
estimating the interactions between user and item features. 
This aids the interpretability of resulting feature embed¬ 
dings. 

4. DATASETS 

I evaluate LightFM’s performance on two datasets. The 
datasets span the range of dense interaction data, where 
MF models can be expected to perform well (MovieLens), 
and sparse data, where CB models tend to perform better 
(CrossValidated). Both datasets are freely available. 

4.1 MovieLens 

The first experiment uses the well-known Movie Len s lOM 
datasel|^ combined with the Tag Genome tag set |22| . 

The dataset consists of approximately 10 million movie 
ratings, submitted by 71,567 users on 10,681 movies. All 
movies are described by their genres and a list of tags from 
the Tag Genome. Each movie-tag pair is accompanied by a 
relevance score (between 0 and 1), denoting how accurately 
a given tag describes the movie. 

To binarise the problem, I treat all ratings below 4.0 (out 
of a 1 to 5 scale) as negative; all ratings equal to or above 4.0 
are positive. I also filter out all ratings that fall below the 
0.8 relevance threshold to retain only highly relevant tags. 

The hnal dataset contains 69,878 users, 10,681 items, 

9, 996,948 interactions, and 1030 unique tags. 

4.2 CrossValidated 

The second dataset consists of questions and answers posted 
on CrossValidatecQ a part of the larger network of Stack- 
Exchange collaborative Q&A sites that focuses on statistics 
and machine learning. The datasel[^ consists of 5953 users, 
44, 200 questions, and 188, 865 answers and comments. Each 
question is accompanied by one or more of 1032 unique tags 
(such as ‘regression’ or ‘hypothesis-testing’). Additionally, 

^http://grouplens.org/datasets/movielens/ 

^http://stats.stackexchange.com 

"https://archive.org/details/stackexchange 




user metadata is available in the form of ‘About Me’ sections 
on users’ profiles. 

The recommendation goal is to match users with questions 
they can answer. A user answering a question is taken as 
an implicit positive signal; all questions that a user has not 
answered are treated as implicit negative signals. For the 
training and test sets, I construct 3 negative training pairs 
for each positive user-question pair by randomly sampling 
from all questions that a given user has not answered. 

To keep the model simple, I focus on a user’s willingness 
to answer a question rather than their ability, and forego 
modelling user expertise [17|. 


Table 1: Results 


Cross Validated MovieLens 



Warm 

Cold 

Warm 

Cold 

LSI-LR 

0.662 

0.660 

0.686 

0.690 

LSI-UP 

0.636 

0.637 

0.687 

0.681 

MF 

0.541 

0.508 

0.762 

0.500 

LightFM (tags) 

0.675 

0.675 

0.744 

0.707 

LightFM (tags -I- ids) 

0.682 

0.674 

0.763 

0.716 

LightFM (tags -I- about) 

0.695 

0.696 




5. EXPERIMENTAL SETUP 

For each dataset, I perform two experiments. The first 
simulates a warm-start setting: 20% of all interaction pairs 
are randomly assigned to the test set, but all items and 
users are represented in the training set. The second is an 
item cold-start scenario: all interactions pertaining to 20% 
of items are removed from the training set and added to 
the test set. This approximates a setting where the recom- 
mender is required to make recommendations from a pool of 
items for which no collaborative information has been gath¬ 
ered, and only content metadata (tags) are available. 

I measure model accuracy using the mean receiver operat¬ 
ing characteristics area under the curve (ROC AUC) metric. 
For an individual user, AUC corresponds to the probability 
that a randomly chosen positive item will be ranked higher 
than a randomly chosen negative item. A high AUC score 
is equivalent to low rank-inversion probability, where the 
recommender mistakenly ranks an unattractive item higher 
than an attractive item. I compute this metric for all users 
in the test set and average it for the final score. 

I compute the AUC metric by repeatedly randomly split¬ 
ting the dataset into a 80% training set and a 20% test set. 
The final score is given averaging across 10 repetitions. 

I test the following models: 


1. MF: a conventional matrix factorisation model with 
user and item biases and a sigmoid link function [^. 

2. LSI-LR: a content-based model. To estimate it, I 
first derive latent topics from the item-feature matrix 
through latent semantic indexing and represent items 
as linear combinations of latent topics. I then ht a 
separate logistic regression (LR) model for each user 
in the topic mixture space. Unlike the LightFM model, 
which uses collaborative data to produce its latent rep¬ 
resentation, LSI-LR is purely based on factorising the 
content matrix. It should therefore be helpful in high¬ 
lighting the beneht of using collaborative information 
for constructing feature embeddings. 


3. LSI-UP: a hybrid model that represents user prohles 
(UP) as linear combinations of items’ content vectors, 
then applies LSI to the resulting matrix to obtain la¬ 
tent user and item representations ( 21 , see Section 
[^. I estimate this model by first constructing a user- 
feature matrix: each row represents a user and is given 
by the sum of content feature vectors representing the 
items that user positively interacted with. I then ap¬ 
ply truncated SVD to the normalised matrix to obtain 
user and feature latent vectors; item latent vectors are 


obtained through projecting them onto the latent fea¬ 
ture space. The recommendations score for a user-item 
pair is then the inner product of their latent represen¬ 
tations. 

4. LightFM (tags): the LightFM model using only tag 
features. 

5. LightFM (tags -|- ids): the LightFM model using 
both tag and item indicator features. 

6. LightFM (tags + about): the LightFM model using 
both item and user features. User features are avail¬ 
able only for the CrossValidated dataset. I construct 
them by converting the ‘About Me’ sections of users’ 
profiles to a bag-of-words representation. I hrst strip 
them of all HTML tags and non-alphabetical charac¬ 
ters, then convert the resulting string to lowercase and 
tokenise on spaces. 


In both LightFM (tags) and LightFM (tags -|- ids) users are 
described only by indicator features. 

I train the LightFM models using stochastic gradient de¬ 
scent with an initial learning rate of 0.05. The latent dimen¬ 
sionality of the models is set to 64 for all models and exper¬ 
iments. This setting is intended to reflect the balance be¬ 
tween model accuracy and the computational cost of larger 
vectors in production systems (additional results on model 
sensitivity to this parameter are presented in Section 6.21. 
I regularise the model through an early-stopping criterion: 
the training is stopped when the model’s performance on 
the test set stops improving. 


6. EXPERIMENTAL RESULTS 
6.1 Recommendation accuracy 

Experimental results are summarised in Table[^ LightFM 
performs very well, outperforming or matching the specialised 
model for each scenario. 

In the warm-start, low-sparsity case (warm-start Movie- 
Lens), LightFM outperforms MF slightly when using both 
tag and item indicator features. This suggest that using 
metadata features may be valuable even when abundant in¬ 
teraction data is present. 

Notably, LightFM (tags) almost matches MF performance 
despite using only metadata features. The LSI-LR and LSI- 
UP models using the same information fare much worse. 
This demonstrates that (1) it \s crucial to use collaborative 
information when estimating content feature embeddings, 
and (2) LightFM can capture that information much more 
accurately than other hybrid models such as LSI-UP. 







In the warm-start, high-sparsity case (warm-start Cross- 
Validated), MF performs very poorly. Because user interac¬ 
tion data is sparse (the CrossValidated user-item matrix is 
99.95% sparse vs only 99% for the MovieLens dataset), MF is 
unable to learn good latent representations. Content-based 
models such as LSI-LR perform much better. 

LightFM variants provide the best performance. LightFM 
(tags -I- about) is by far the best model, showing the added 
advantage of LightFM’s ability to integrate user metadata 
embeddings into the recommendation model. This is likely 
due to improved prediction performance for users with little 
data in the training set. 

Results for the cold-start cases are broadly similar. On 
the CrossValidated dataset, all variants of LightFM outper¬ 
form other models; LightFM (tags -I- about) again provides 
the best performance. Interestingly, LightFM (tags -1- indi¬ 
cators) outperforms LightFM (tags) slightly on the Movie- 
Lens dataset, even though no embeddings can be estimated 
for movies in the test set. This suggests that using both 
metadata and per-movie features allows the model to esti¬ 
mate better embeddings for both, much like the use of user 
and item bias terms allows better latent factors to be com¬ 
puted. Unsurprisingly, MF performs no better than random 
in the cold-start case. 

In all scenarios the LSI-UP model performs no better than 
the LSI-LR model, despite its attempt to incorporate col¬ 
laborative data. On the CrossValidated dataset it performs 
strictly worse. This might be because its latent representa¬ 
tions are estimated on less data than in LSI-LR: as there are 
fewer users than items in the dataset, there are fewer rows 
in the user-feature matrix than in the item-feature matrix. 

The results conhrm that LightFM encompasses both the 
MF and the LSI-LR model as special cases, performing bet¬ 
ter than the LSI-LR model in the sparse-data scenario and 
better than the MF model in the dense-data case. This 
means not only that a single model can be maintained in 
either settings, but also that the model will continue to 
perform well even when the sparsity structure of that data 
changes. 

Good performance of LightFM (tags) in both datasets 
is predicated on the availability of high-quality metadata. 
Nevertheless, it is often possible to obtain good quality meta¬ 
data from item descriptions (genres, actor lists and so on), 
expert or community tagging {Pandora |23| , StackOverflow), 
or computer vision systems where image or audio data is 
available (we use image-based convolutional neural networks 
for product tagging). In fact, the feature embeddings pro¬ 
duced by LightFM can themselves be used to assist the tag¬ 
ging process by suggesting related tags. 

6.2 Parameter Sensitivity 

Figurej^plots the accuracy of LightFM, LSI-LR, and LSI- 
UP against values of the latent dimensionality hyperparam¬ 
eter d in the cold-start scenario (averaged over 30 runs of 
each algorithm). As d increases, each model is capable of 
modelling more complex structures and achieves better per¬ 
formance. 

Interestingly, LightFM performs very well even with a 
small number of dimensions. In both datasets LightFM 
consistently outperforms other models, achieving high per¬ 
formance with as few as 16 dimensions. On CrossValidated 
data, it achieves the same performance as the LSI-LR model 
for much smaller d: it matches the accuracy of the 512- 


Table 2: Tag similarity 


Query tag 

Similar tags 

‘regression’ 

‘least squares’, ‘multiple regression’, ‘re¬ 
gression coefficients’, ‘multicollinearity’ 

‘MCMC’ 

‘BUGS’, ‘Metropolis-Hastings’, ‘Beta- 

Binomial’, ‘Gibbs’, ‘Bayesian’ 

‘survival’ 

‘epidemiology’, ‘Cox model’, ‘Kaplan- 
Meier’, ‘hazard’ 

‘art house’ 

‘pretentions’, ‘boring’, ‘graphic novel’, 
‘pointless’, ‘weird’ 

‘dystopia’ 

‘post-apocalyptic’, ‘futuristic’, ‘artificial in¬ 
telligence’ 

‘bond’ 

‘007’, ‘secret service’, ‘nuclear bomb’, ‘spy¬ 
ing’, ‘assassin’ 


dimensional LSI-LR model even when using fewer than 32 
dimensions. 

This is an important win for large-scale recommender sys¬ 
tems, where the choice of d is governed by a trade-off be¬ 
tween vector size and recommendation accuracy. Since smaller 
vectors occupy less memory and use fewer computations dur¬ 
ing query time, better representational power at small d al¬ 
lows the system to achieve the same model performance at 
a smaller computational cost. 

6.3 Tag embeddings 

Feature embeddings generated by the LightFM model cap¬ 
ture important information about the semantic relationships 
between different features. Table gives some examples by 
listing groups of tags similar (in the cosine similarity sense) 
to a given query tag. 

In this respect, LightFM is similar to recent word em¬ 
bedding approaches like word2vec and GloVe [12[|13| . This 
is perhaps unsurprising, given that word embedding tech¬ 
niques are closely related to forms of matrix factorisation 
[^. Nevertheless, LightFM and word embeddings differ in 
one important respect: whilst word2vec and GloVe embed¬ 
dings are driven by textual corpus co-incidence statistics, 
LightFM is based on user interaction data. 

LightFM embeddings are useful for a number of recom¬ 
mendation tasks. 

1. Tag recommendation. Various applications use col¬ 
laborative tagging as a way of generating richer meta¬ 
data for use in search and recommender system EE- 
A tag recommender can enhance this process by either 
automatically applying matching tags, or generating 
suggested tags lists for approval by users. LightFM- 
produced tag embeddings will work well for this task 
without the need to build a separate specialised model 
for tag recommendations. 

2. Genre or category recommendation. Many do¬ 
mains are characterised by an ontology of genres or 
categories which play an important role in the presen¬ 
tation of recommendations. For example, the Netfiix 
interface is organised in genre rows; for Lyst, fashion 
designers, categories and subcategories are fundamen¬ 
tal. The degree of similarity between the embeddings 
of genres or categories provides a ready basis for genre 
or category recommendations that respect the seman¬ 
tic structure of the ontology. 





Figure 1: Latent dimension sensitivity 




(a) Cross Validated (b) MovieLens 


3. Recommendation justification. Rich information 
encoded in feature embeddings can help provide expla¬ 
nations for recommendations made by the system. For 
example, we might recommend a ball gown to a user 
who likes pencil skirts, and justify it by the two fea¬ 
tures’ similarity as revealed by the distance between 
their latent factors. 

7. USAGE IN PRODUCTION SYSTEMS 

The LightFM approach is motivated by our experience 
at Lyst. We have deployed LightFM in production, and 
successfully use it for a number of recommendation tasks. In 
this section, I describe some of the engineering and algorithm 
choices that make this possible. 

7.1 Model training and fold-in 

Thousands of new items and users appear on Lyst every 
day. To cope with this, we train our LightFM model in 
an online manner, continually updating the representations 
of existing features and creating fresh representations for 
features that we have never observed before. 

We store model state, including feature embeddings and 
accumulated squared gradient information in a database. 
When new data on user interaction arrives, we restore the 
model state and resume training, folding in any newly ob¬ 
served features. Since our implementation uses per-parameter 
diminishing learning rates (Adagrad), any updates of es¬ 
tablished features will be incremental as the model adapts to 
new data. For new features, a high learning rate is used to 
allow useful embeddings to be learned as quickly as possible. 

No re-training is necessary for folding in new products: 
their representation can be immediately computed as the 
sum of the representations of their features. 


7.2 Feature engineering 

Each of our products is described by a set of textual fea¬ 
tures as well as structured metadata such as its type (dress, 
shoes and so on) or designer. These are accompanied by 
additional features coming from two sources. 

Firstly, we employ a team of experienced fashion modera¬ 
tors, helping us to derive more fine-grained features such as 
clothing categories and subcategories (peplum dress, halter- 
neck and so on). 

Secondly, we use machine learning systems for automatic 
feature detection. The most important of these is a set 
of deep convolutional neural networks deriving feature tags 
from product image data. 

7.3 Approximate nearest neighbour searches 

The biggest application of LightFM-derived item repre¬ 
sentations are related product recommendations: given a 
product, we would like to recommend other highly relevant 
products. To do this efficiently across 8 million products, we 
use a combination of approximate (for on-demand recom¬ 
mendations) and exact (for near-line computation) nearest 
neighbour search. 

For approximate nearest neighbour (ANN) queries, we use 
Random Projection (RP) trees . RP trees are a vari¬ 
ant of random-projection based locality sensitive hashing 
(LSH). 

In LSH, fc-bit hash codes for each point x are generated 
by drawing random hyperplanes v, and then setting the fc-th 
bit of the hash code to 1 if a: • u > 0 and 0 otherwise. The 
approximate nearest neighbours of x are then other points 
that share the same hash code (or whose hash codes are 
within some small Hamming distance of each other). 

While extremely fast, LSH has the undesirable property 
of sometimes producing very highly unbalanced distribution 
of points across all hash codes: if points are densely con¬ 
centrated, many codes of the tree will apply to no products 














while some will describe a very large number of points. This 
is unacceptable when building a production system, as it 
will lead to many queries being very slow. 

RP trees provide much better guarantees about the size 
of leaf nodes: at each internal node, points are split based 
on the median distance to the chosen random hyperplane. 
This guarantees that at every split approximately half the 
points will be allocated to each leaf, making the distribution 
of points (and query performance) much more predictable. 

8. CONCLUSIONS AND FUTURE WORK 

In this paper, I have presented an effective hybrid recom- 
mender model dubbed LightFM. I have shown the following: 

1. LightFM performs at least as well as a specialised 
model across a wide range of collaborative data spar¬ 
sity scenarios. It outperforms existing content-based 
and hybrid models in cold-start scenarios where col¬ 
laborative data is abundant or where user metadata is 
available. 

2. It produces high-quality content feature embeddings 
that capture important semantic information about 
the problem domain, and can be used for related tasks 
such as tag recommendations. 

Both properties make LightFM an attractive model, appli¬ 
cable both in cold- and warm-start settings. Nevertheless, 
I see two promising directions in extending the current ap¬ 
proach. 

Firstly, the model can be easily extended to use more so¬ 
phisticated training methodologies. For example, an optimi¬ 
sation scheme using Weighted Approximate-Rank Pairwise 
loss or directly optimising mean reciprocal rank could 
be used [I^ . 

Secondly, there is no easy way of incorporating visual or 
audio features in the present formulation of LightFM. At 
Lyst, we use a two-step process to address this: we first 
use convolutional neural networks (CNNs) on image data 
to generate binary tags for all products, and then use the 
tags for generating recommendations. We conjecture that 
substantial improvements could be realised if the CNNs were 
trained with recommendation loss directly. 
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